secret word
Eliciting Secret Knowledge from Language Models
Cywiński, Bartosz, Ryd, Emil, Wang, Rowan, Rajamanoharan, Senthooran, Nanda, Neel, Conmy, Arthur, Marks, Samuel
Model Organisms (MOs) research involves intentionally training models to exhibit specific failure modes, to serve as a testbed for study and development of mitigations (Hubinger et al., 2024; Denison et al., 2024; Marks et al., 2025). Prior work has introduced several types of MOs, including models that conceal capabilities unless a specific trigger is present in the input (Greenblatt et al., 2024b; van der Weij et al., 2025), fake alignment to evade safety measures (Greenblatt et al., 2024a), and display broad misalignment after being fine-tuned on a narrow distribution of harmful data (Bet-ley et al., 2025). The secret-keeping models trained in this work represent a novel class of MOs that refrain from revealing that they have certain factual knowledge. Auditing Language Models Our work contributes to the growing field of alignment auditing, which aims to systematically investigate whether a model pursues undesired or hidden objectives, rather than merely evaluating its surface-level behavior (Casper et al., 2024). A central methodology for validating such audits is to construct a testbed with a known ground truth, a principle applied in prior work (Schwettmann et al., 2023; Rager et al., 2025).
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Health & Medicine (1.00)
- Education > Health & Safety > School Nutrition (0.46)
Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Badola, Kartikeya, Simon, Jonathan, Hosseini, Arian, Carthy, Sara Marie Mc, Munkhdalai, Tsendsuren, Goyal, Abhimanyu, Kočiský, Tomáš, Upadhyay, Shyam, Fatemi, Bahare, Kazemi, Mehran
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.
- North America (0.28)
- Asia (0.28)
How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison
Wang, Jiayin, Guo, Zhiquang, Ma, Weizhi, Zhang, Min
As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games
Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents' ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games - Undercover and Adversarial Taboo - which emphasize Covert Communication and Semantic Evasion. Experimental results demonstrate that CoMet significantly enhances the agents' ability to communicate strategically using metaphors.
- Asia (0.46)
- North America > Mexico (0.28)
- Leisure & Entertainment > Games (1.00)
- Government (0.67)
Towards eliciting latent knowledge from LLMs with mechanistic interpretability
Cywiński, Bartosz, Ryd, Emil, Rajamanoharan, Senthooran, Nanda, Neel
As language models become more powerful and sophisticated, it is crucial that they remain trustworthy and reliable. There is concerning preliminary evidence that models may attempt to deceive or keep secrets from their operators. To explore the ability of current techniques to elicit such hidden knowledge, we train a Taboo model: a language model that describes a specific secret word without explicitly stating it. Importantly, the secret word is not presented to the model in its training data or prompt. We then investigate methods to uncover this secret. First, we evaluate non-interpretability (black-box) approaches. Subsequently, we develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders. Evaluation shows that both approaches are effective in eliciting the secret word in our proof-of-concept setting. Our findings highlight the promise of these approaches for eliciting hidden knowledge and suggest several promising avenues for future work, including testing and refining these methods on more complex model organisms. This work aims to be a step towards addressing the crucial problem of eliciting secret knowledge from language models, thereby contributing to their safe and reliable deployment.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
Do LLMs Strategically Reveal, Conceal, and Infer Information? A Theoretical and Empirical Analysis in The Chameleon Game
Karabag, Mustafa O., Topcu, Ufuk
Large language model-based (LLM-based) agents have become common in settings that include non-cooperative parties. In such settings, agents' decision-making needs to conceal information from their adversaries, reveal information to their cooperators, and infer information to identify the other agents' characteristics. To investigate whether LLMs have these information control and decision-making capabilities, we make LLM agents play the language-based hidden-identity game, The Chameleon. In the game, a group of non-chameleon agents who do not know each other aim to identify the chameleon agent without revealing a secret. The game requires the aforementioned information control capabilities both as a chameleon and a non-chameleon. The empirical results show that while non-chameleon LLM agents identify the chameleon, they fail to conceal the secret from the chameleon, and their winning probability is far from the levels of even trivial strategies. To formally explain this behavior, we give a theoretical analysis for a spectrum of strategies, from concealing to revealing, and provide bounds on the non-chameleons' winning probability. Based on the empirical results and theoretical analysis of different strategies, we deduce that LLM-based non-chameleon agents reveal excessive information to agents of unknown identities. Our results point to a weakness of contemporary LLMs, including GPT - 4, GPT -4o, Gemini 1.5, and Claude 3.5 Sonnet, in strategic interactions.
- Leisure & Entertainment > Games (1.00)
- Information Technology > Security & Privacy (0.68)
- Leisure & Entertainment > Sports (0.67)
LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments
Chen, Junzhe, Hu, Xuming, Liu, Shuodi, Huang, Shiyu, Tu, Wei-Wei, He, Zhaofeng, Wen, Lijie
Recent advancements in large language models (LLMs) have revealed their potential for achieving autonomous agents possessing human-level intelligence. However, existing benchmarks for evaluating LLM Agents either use static datasets, potentially leading to data leakage or focus only on single-agent scenarios, overlooking the complexities of multi-agent interactions. There is a lack of a benchmark that evaluates the diverse capabilities of LLM agents in multi-agent, dynamic environments. To this end, we introduce LLMArena, a novel and easily extensible framework for evaluating the diverse capabilities of LLM in multi-agent dynamic environments. LLMArena encompasses seven distinct gaming environments, employing Trueskill scoring to assess crucial abilities in LLM agents, including spatial reasoning, strategic planning, numerical reasoning, risk assessment, communication, opponent modeling, and team collaboration. We conduct an extensive experiment and human evaluation among different sizes and types of LLMs, showing that LLMs still have a significant journey ahead in their development towards becoming fully autonomous agents, especially in opponent modeling and team collaboration. We hope LLMArena could guide future research towards enhancing these capabilities in LLMs, ultimately leading to more sophisticated and practical applications in dynamic, multi-agent settings. The code and data will be available.
- Europe > Austria > Vienna (0.14)
- North America > United States > Texas (0.05)
- Europe > Middle East (0.04)
- (9 more...)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration
Xu, Lin, Hu, Zhiyuan, Zhou, Daquan, Ren, Hongyu, Dong, Zhen, Keutzer, Kurt, Ng, See Kiong, Feng, Jiashi
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing, demonstrating exceptional capabilities in reasoning, tool usage, and memory. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework that captures their abilities in reasoning, planning, collaboration, and more. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings, providing quantitative metrics to evaluate their judgment, reasoning, deception, self-awareness, cooperation, coordination, and rationality. We utilize games such as Chameleon and Undercover, alongside game theory scenarios like Cost Sharing, Multi-player Prisoner's Dilemma, and Public Good, to create diverse testing environments. Our framework is fortified with the Probabilistic Graphical Modeling (PGM) method, enhancing the LLMs' capabilities in navigating complex social and cognitive dimensions. The benchmark evaluates seven multi-agent systems powered by different LLMs, quantitatively highlighting a significant capability gap over threefold between the strongest, GPT-4, and the weakest, Llama-2-70B. It also confirms that our PGM enhancement boosts the inherent abilities of all selected models by 50% on average. Our codes are released here https://github.com/cathyxl/MAgIC.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Italy (0.04)
- Europe > France (0.04)
- (2 more...)
Wordle-solving state of the art: all optimality results so far -- Laurent's notes
Most mathematical questions one could have about Wordle are settled by now, and a few remain open. I summarize here what is known, as far as I can tell. First, let's clarify a few things about the game: Wordle comes with a dictionary of 12972 words that the player is allowed to use as guesses. They are essentially all 5-letter combinations one could reasonably argue are English words. The "secret" word that the player has to discover is also always in that dictionary.